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ABSTRACT 

We  describe  recent  extensions  to  our  previous  work,  where  we  ex¬ 
plored  the  use  of  individual  classifiers,  namely,  boosting  and  max¬ 
imum  entropy  models  for  sentence  segmentation.  In  this  paper  we 
extend  the  set  of  classification  methods  with  support  vector  machine 
(SVM).  We  propose  a  new  dynamic  entropy-based  classifier  combi¬ 
nation  approach  to  combine  these  classifiers,  and  compare  it  with  the 
traditional  classifier  combination  techniques,  namely,  voting,  linear 
regression  and  logistic  regression.  Furthermore,  we  also  investigate 
the  combination  of  hidden  event  language  models  with  the  output 
of  the  proposed  classifier  combination,  and  the  output  of  individual 
classifiers.  Experimental  studies  conducted  on  the  Mandarin  TDT4 
broadcast  news  database  shows  that  the  SVM  classifier  as  an  individ¬ 
ual  classifier  improves  over  our  previous  best  system.  However,  the 
proposed  entropy-based  classifier  combination  approach  shows  the 
best  improvement  in  F-Measure  of  1%  absolute,  and  the  voting  ap¬ 
proach  shows  the  best  reduction  in  NIST  error  rate  of  2.7%  absolute 
when  compared  to  the  previous  best  system. 

Index  Terms —  sentence  segmentation,  classifier  combination, 
entropy,  lexical  and  prosodic  features,  hidden  event  language  model 

1.  INTRODUCTION 

Sentence  segmentation  aims  to  enrich  the  unstructured  word  sequence 
output  of  automatic  speech  recognition  (ASR)  systems  with  sentence 
boundaries,  to  ease  the  further  processing  by  both  humans  and  ma¬ 
chines.  For  instance,  the  enriched  output  of  ASR  (i.e.,  with  sentence 
boundaries  marked)  can  be  used  for  later  processing  such  as  machine 
translation,  question  answering,  or  story/topic  segmentation. 

In  the  literature,  the  task  of  detecting  sentence  boundaries  is  seen 
as  a  two-class  classification  problem,  and  different  classifiers  have 
been  investigated.  [1]  and  [2]  use  a  method  that  combines  hidden 
Markov  models  (HMM)  with  N-gram  language  models  containing 
words  and  sentence  boundary  tags  associated  with  them  [3].  This 
method  was  later  extended  with  confusion  networks  in  [4].  [5]  pro¬ 
vides  an  overview  of  different  classification  algorithms  (boosting, 
hidden-event  language  models,  maximum  entropy  models  and  de¬ 
cision  trees)  applied  to  this  task  for  multilingual  broadcast  news. 
Besides  the  type  of  classifier,  the  use  of  different  features  has  been 
widely  studied.  [2,  6]  showed  how  prosodic  features  can  benefit  the 
sentence  segmentation  task.  Investigations  on  prosodic  and  lexi¬ 
cal  features  in  the  context  of  telephone  conversations  and  broadcast 
news  speech  are  also  presented  in  [6,  5].  More  recently,  the  use  of 
syntactic  features  has  been  studied  in  [7]. 

In  this  paper,  we  extend  our  previous  work  on  sentence  segmen¬ 
tation  for  broadcast  news  in  the  framework  of  the  DARPA  GALE 
program  [5],  where  different  classifiers,  namely,  decision  trees,  boost¬ 
ing,  and  maximum  entropy  models,  were  investigated.  It  was  found 
that  boosting  was  the  best  individual  classifier.  Furthermore,  it  was 


shown  that  performance  can  be  further  improved  by  combining  the 
individual  classifier  probabilities  with  the  probabilities  estimated  from 
hidden  event  language  model  (HELM). 

While  our  previous  work  focused  on  individual  classifiers  and 
their  combination  with  HELM,  this  paper  focuses  on  combining  the 
individual  classifier  outputs  to  improve  the  performance  of  sentence 
segmentation.  More  specifically,  we  study  the  use  of  the  entropy- 
based  classifier  combination  approach,  which  has  been  successfully 
used  for  ASR  [8].  The  entropy-based  approach  dynamically  esti¬ 
mates  the  weight  for  each  classifier  based  on  its  instantaneous  out¬ 
put  probabilities  for  each  example  and  combines  the  output  proba¬ 
bilities  of  the  different  classifiers  before  making  the  final  decision. 
We  compare  this  approach  to  a  standard  classifier  decision  combina¬ 
tion  approach,  voting,  and  other  static-weight-based  classifier  com¬ 
bination  techniques  such  as  linear  regression  and  logistic  regression. 
The  predicted  advantage  of  the  entropy-based  approach  over  linear 
regression  and  logistic  regression  is  that  it  can  generalize  well  to  dif¬ 
ferent  data  sets.  Our  results  on  the  Mandarin  TDT4  broadcast  news 
database  validate  this  prediction.  The  classifiers  used  in  our  stud¬ 
ies  are  boosting,  maximum  entropy  models,  and  SVM.  Finally,  we 
also  study  the  combination  of  individual  classifier  output  probabili¬ 
ties  as  well  as  the  combined  probabilities  resulting  from  the  entropy- 
based  approach  with  HELM  probabilities.  The  newly  proposed  dy¬ 
namic  classifier  combination  method  for  sentence  segmentation  fur¬ 
ther  improves  performance.  When  we  combine  the  outcome  of  clas¬ 
sifier  combination  with  HELMs,  we  obtain  the  best  performance  of 
55.2%  NIST  error  rate,  a  2.7%  absolute  reduction  from  the  NIST 
error  rate  of  57.9%  obtained  with  the  previous  approach.  With  this 
new  method,  the  F-measure  also  improves  from  70.6%  to  71.6%. 

The  rest  of  the  paper  is  organized  as  follows.  Section  2  for¬ 
mulates  the  sentence  segmentation  problem  as  a  classification  task, 
and  then  describes  the  different  classifiers  and  classifier  combination 
techniques  that  have  been  investigated.  Sections  3  describes  our  ex¬ 
perimental  setup  and  the  results  are  presented  in  Section  4.  Finally, 
in  Section  5  we  conclude. 

2.  SENTENCE  SEGMENTATION 

Sentence  segmentation  can  be  considered  as  a  binary  boundary  clas¬ 
sification  problem  with  “sentence-boundary”  (s)  and  “non- sentence¬ 
boundary”  (n)  asclasses  [5].  For  a  given  word  sequence  {tui, ...,  wn}, 
the  goal  is  to  estimate  the  classes  for  boundaries  {si, ...,  sjv},  where 
Si,  i  —  1, N  is  the  boundary  between  Wi  and  tUi+i.  Usually,  this 
is  done  by  training  a  classifier  to  estimate  the  posterior  probability 
P{si  =  k\oi),  where  k  G  {s,  n}  and  Oi  are  the  feature  observations 
for  the  word  boundary  Si. 

Ideally,  the  decision  of  the  classifier  is  the  class  with  maximum 
probability  P{si  =  k\oi).  However,  in  a  sentence  segmentation  task 
the  probability  for  sentence  boundary  P{si  =  s|oi)  is  compared 
against  a  threshold.  If  above  the  threshold,  a  decision  of  sentence 
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boundary  is  made,  else  the  boundary  is  marked  as  a  non-sentence- 
boundary.  As  it  will  be  seen  later,  for  different  classifiers  and  differ¬ 
ent  evaluation  metrics  the  optimal  threshold  is  different. 

2.1.  Classifiers 

In  this  work,  we  have  used  boosting,  maximum  entropy  and  SVM 
classifiers  to  estimate  P{si\oi). 

Boosting  is  an  iterative  learning  algorithm  that  aims  to  combine 
“weak”  base  classifiers  to  come  up  with  a  “strong”  classifier.  At  each 
iteration,  a  weak  classifier  is  learned  so  as  to  minimize  the  training 
error,  and  a  different  distribution  or  weighting  over  the  training  ex¬ 
amples  is  used  to  give  more  emphasis  to  examples  that  are  often 
misclassified  by  the  preceding  weak  classifiers.  For  this  approach 
we  use  the  BoosTexter  tool  described  in  [9],  which  has  the  advantage 
of  discriminative  training.  Moreover,  it  can  deal  with  a  large  set  of 
features  both  discrete  and  continuous  valued,  and  has  the  capability 
to  handle  missing  feature  values. 

Maximum  entropy  (MaxEnt)  models  are  prominently  used  for 
natural  language  processing  [10].  The  main  advantages  of  MaxEnt 
models  are  discriminative  training,  capability  to  handle  a  large  set 
of  features,  flexibility  in  handling  missing  feature  values,  and  con¬ 
vergence  to  a  unique  defined  global  optimum.  In  standard  MaxEnt 
models,  the  features  are  discrete  valued.  However,  the  feature  set 
used  in  our  studies  contains  continuous  valued  features  in  addition 
to  discrete  valued  features.  The  process  of  discretizing  the  continu¬ 
ous  valued  features  for  MaxEnt  models  is  subject  to  research.  In  this 
work,  we  discretize  the  continuous  valued  feature  by  binning  them 
to  10  classes  similar  to  [5].  We  use  the  open  NLP  toolkit  to  train  the 
MaxEnt  models  [11]. 

SVMs  are  used  in  wide  range  of  pattern  recognition  applica¬ 
tions  [12].  SVMs  provide  advantages  similar  to  MaxEnt  models  in 
terms  of  discriminative  training  and  capability  to  handle  a  large  set 
of  features.  In  addition  to  it,  SVMs  can  handle  both  discrete  val¬ 
ued  and  continuous  valued  features.  In  our  studies,  we  normalize 
the  continuous  valued  features  in  the  training  data  so  as  to  have  a 
zero  mean  and  unit  variance.  In  our  data,  there  are  instants  where 
a  feature  value  may  be  missing  (for  the  words  that  ASR  outputs  as 
reject  labels).  In  such  a  case,  we  introduce  a  new  token  for  unknown 
discrete  valued  features.  Eor  continuous  valued  features,  we  use  the 
mean  value,  that  is  0.0.  When  training  SVM  classifier  we  found  that 
the  choice  of  the  kernel  type  is  not  obvious.  When  using  mainly  lexi¬ 
cal  features  (discrete  valued)  the  best  performance  is  achieved  with  a 
linear  kernel,  whereas  while  using  both  lexical  and  prosodic  features 
(both  discrete  and  continuous  valued)  the  use  of  a  polynomial  kernel 
of  degree  2  yields  the  best  performance.  Eor  our  studies,  we  use  the 
SVMlight  toolkit  [13]. 

2.2.  Classifier  Combination 

Combining  classifiers  is  a  well  researched  topic  in  machine  learning 
and  spoken  language  processing  [14,  15,  among  others].  Some  of 
the  most  common  classifier  combination  methods  used  in  the  litera¬ 
ture  include  voting,  and  linear  and  logistic  regression.  The  general 
objective  of  classifier  combination  is  to  exploit  the  complementary 
information  between  the  classifiers.  In  a  sense,  the  different  classi¬ 
fiers  in  a  classifier  combination  can  be  seen  as  a  collection  of  weak 
classifiers,  where  each  classifier  can  solve  some  different  difficult 
problems.  Then,  the  process  of  combination  involves  combining  the 
decisions  of  classifiers  or  assigning  a  weight  to  each  classifier’s  out¬ 
put  evidence  (in  our  case  P{si\oi))  and  combining  the  evidence  so 
as  to  reduce  the  objective  error.  The  weights  can  be  estimated  stati¬ 
cally,  that  is,  a  priori,  on  held-out  data  or  development  data,  for  ex¬ 
ample  linear  regression  or  dynamically,  for  example  inverse  entropy 


combination.  In  this  paper,  we  have  investigated  both. 

Let  c  €  {1, ...,  C}  and  Pc{si  =  k\oi)  denote  the  different  clas¬ 
sifiers  (in  our  case  (7  =  3)  and  their  output  probabilities  at  instant 
i.  Given  these,  we  investigate  the  following  classifier  combination 
techniques: 

•  Inverse  entropy 

•  Voting 

•  Linear  regression 

•  Logistic  regression 
2.2.1.  Inverse  Entropy 

Given  the  instantaneous  classifier  output  probabilities,  the  idea  of 
inverse  entropy  combination  is  to  assign  large  weights  to  classifiers 
that  are  more  confident  in  their  decisions  and  small  weights  to  clas¬ 
sifiers  that  are  less  confident  in  their  decisions  [8].  The  confidence  is 
measured  in  terms  of  entropy  entc  of  classifier  output  probabilities, 
that  is, 

K 

entc  =  ^  -Pc{si  =  k\oi)  ■  log2(Pc(si  =  k\oi)),  Vc  (1) 

fc=i 

where  K  is  the  number  of  output  classes,  which  in  our  case  is  2. 

The  weight  tUc  for  each  classifier  is  then  estimated  as 


UJc 


1 

entc 

y^C  1  ’ 

^L^c=l  entc 


Vc 


(2) 


Having  estimated  the  weight  for  each  classifier,  the  output  prob¬ 
abilities  of  the  classifiers  can  be  combined  in  two  different  ways: 

1 .  Sum  rule 


c 

P{si  =  k\oi)  =  Pc{si  =  k\oi),  'ik  (3) 

C=1 


2.  Product  rule 

1  c- 

P{si  =  k\oi)  = -■\{P,{si  =  k\0iY%  Vk  (4) 

C=1 

where  Z  is  a  normalization  factor. 

The  decision  about  output  class  is  then  made  based  upon  P(si  = 
s|oi)- 

2.2.2.  Voting 

The  voting  technique  looks  for  agreement  between  classifiers  output 
decisions.  Typically,  each  classifier  votes  for  each  output  class  based 
on  its  decision.  The  final  decision  or  the  output  class  is  the  class 
getting  the  maximum  number  of  votes. 

In  our  case,  all  classifiers  have  one  vote  and  each  classifier  c 
decides  its  vote  by  comparing  Pc{si  =  s|oi)  to  the  threshold. 

2.2.3.  Linear  Regression 

In  a  linear  regression  classifier  combination,  the  combined  probabil¬ 
ity  for  the  class  sentence  boundary  (P(si  =  s  I  Oi))  is  estimated  as  the 
linearly  weighted  sum  of  classifier  output  probabilities  for  sentence 
boundary  class  Pc{si  =  s|oi): 

c 

P{si  =  s\0i)  =  a  +  ^uj^- PYsi  =  s\0i)  (5) 

C=1 

where  a  is  a  bias  term.  The  bias  a  and  the  weights  Wc  are  optimized 
on  development  data,  and  are  frozen  for  the  test  data.  In  our  studies, 
we  use  the  MATLAB  function  regress  to  estimate  the  weights  and 
constant. 


2.2.4.  Logistic  Regression 


The  classifier  output  probabilities  can  be  combined  using  logistic  re¬ 
gression  by  treating  them  as  predictor  variables  of  the  logistic  func¬ 
tion,  and  estimating  the  combined  probability  for  the  sentence  bound¬ 
ary  class  (P{si  =  s|oi))  as 


P{Si  =  s|oi) 


1 

“c'-Po(si=s|oi)) 


(6) 


FM 

NIST 


2. precision. recall 
{precision  +  recall) 
{ITD) 

{C  +  D) 


(7) 

(8) 


Here  note  that  unlike  FM,  NIST  does  not  have  an  upper  bound. 


4.  RESULTS 


where  the  bias  a  and  weights  Wc  for  each  classifier  c  are  estimated 
on  a  development  data  similar  to  linear  regression.  In  our  work,  we 
estimate  the  optimal  a  and  Wc  on  the  development  data  using  the 
Newton-Raphson  method  [16]. 

3.  EXPERIMENTAL  SETUP 

3.1.  Data 

To  evaluate  our  approach,  we  have  used  a  subset  of  the  TDT4  Man¬ 
darin  Broadcast  News  corpus.  Table  1  lists  the  properties  of  the  train¬ 
ing,  development  and  test  sets,  which  are  picked  out  in  a  time  order 
(that  is,  all  shows  in  the  training  set  precede  the  development  and 
test  set  shows).  The  sentence  segmentation  experiments  have  been 
performed  on  the  ASR  output  that  was  used  in  our  earlier  work  [5]. 


Training 

Dev 

Test 

No.  of  shows 

131 

17 

17 

No.  of  sentence  units 

19,643 

2,903 

2,991 

No.  of  examples 

481,419 

72,152 

77,915 

Table  1.  Training,  Development  (Dev)  and  Test  data  sets. 


3.2.  Eeature  sets 

To  study  the  use  of  prosodic  features,  we  use  two  feature  sets,  namely, 
WP  and  AIT: 

•  WP\  includes  only  the  words  around  the  boundary,  pause  du¬ 
ration,  and  their  N-grams 

•  ALT.  includes  WP,  prosodic  features,  speaker  turn,  and  part- 
of-speech  (POS)  tag  N-grams 

In  addition  to  the  word-based  prosodic  features  related  to  pitch, 
pitch  slope,  energy  and  pause  duration  between  words  that 
were  used  in  our  earlier  work  [5],  we  also  use  speaker  normal¬ 
ized  versions  of  pitch  and  energy  features.  We  used  the  ICSI- 
SRI  speaker  diarization  system  to  divide  the  audio  data  into 
hypothetical  speakers  [17].  The  baseline  and  ranges  for  pitch 
and  energy  features  were  estimated  on  each  speaker,  which 
were  then  used  as  parameters  in  the  normalization.  Further, 
the  prosodic  feature  also  includes  turn-based  features  which 
describe  the  position  of  a  word  in  relation  to  diarization  seg¬ 
mentation.  The  speaker  turn  features  were  extracted  from  the 
diarization  segmentation.  We  use  the  POS  tag  features  that 
were  used  in  [5]. 


3.3.  Evaluation  Metrics 


We  evaluate  our  systems  in  terms  of  F-measure  (FM)  and  NIST 
error  rate  (NIST)  [18].  If  /  is  the  number  of  insertions  (i.e.,  false 
positives),  D  is  the  number  of  deletions  (i.e.,  false  negative)  and  C 
is  the  number  of  sentence  boundaries  correctly  recognized,  then: 


precision  = 


C 

Jc^y 


recall  — 


C 

[C  +  D) 


Tables  2  and  3  show  sentence  segmentation  performance  with  indi¬ 
vidual  classifiers  and  combination  classifiers  with  various  methods 
for  the  development  and  test  sets,  respectively,  using  both  the  WP 
and  ALL  feature  sets.  In  all  the  tests,  we  use  the  optimum  thresholds 
on  the  development  set  for  computing  the  NIST  error  and  F-measure 
on  the  test  set.  With  the  ALL  features,  out  of  the  three  classifiers, 
SVM  performs  the  best  with  57.6%  NIST  error  rate  and  69.8%  F- 
measure.  The  inverse  entropy  (IE)  classifier  combination  method 
results  in  the  best  F-measure  of  70.9%  on  the  test  set,  and  the  vot¬ 
ing  method  results  in  the  lowest  NIST  error  rate  of  56.6%.  The 
performance  difference  between  the  voting  and  IE  methods  and  the 
linear  and  logistic  regression  methods  on  the  test  set  is  statistically 
significant,'  the  first  two  performing  better.  The  MaxEnt  classifier 
performs  the  best  on  the  W P  feature  set,  but  performs  poorer  com¬ 
pared  to  boosting  and  SVM  on  feature  set  ALL.  This  may  be  pri¬ 
marily  due  to  the  way  the  discretization  of  the  continuous  valued 
feature  is  performed. 


WP 

ALL 

FM 

NIST 

FM 

NIST 

Boosting 

68.0 

64.9 

72.6 

54.3 

MaxEnt 

68.5 

64.5 

69.1 

60.3 

SVM 

67.9 

64.9 

72.7 

52.2 

IE  (sum) 

69.4 

61.8 

73.6 

51.8 

IE  (prod) 

69.2 

61.7 

73.5 

51.8 

Voting 

69.0 

62.4 

73.4 

51.6 

Linear 

69.1 

61.9 

73.6 

52.2 

Logistic 

69.6 

61.4 

73.9 

51.9 

Table  2.  Sentence  segmentation  performance  on  the  development 
set  for  feature  sets  WP  and  ALL.  The  FM  and  NIST  are  expressed 
in  %.  IE  denotes  inverse  entropy,  and  sum  and  prod  refer  to  the 
sum  and  product  combination  rules  in  (3)  and  (4),  respectively.  Lin¬ 
ear  denotes  the  linear  regression  technique,  and  Logistic  denotes 
the  logistic  regression  technique.  Note  that  the  linear  regression  and 
logistic  regression  weights  were  estimated  on  the  development  set. 
The  best  performance  for  individual  classifiers  and  classifier  combi¬ 
nation  is  in  boldface. 

Table  4  shows  the  performance  when  we  convert  the  posterior 
probabilities  from  each  classifier  and  combination  classifier  and  use 
them  as  state  observation  likelihoods  in  the  hidden  event  language 
model  (HELM).  The  HELM  used  for  this  experiment  is  a  4-gram 
model  and  is  trained  using  all  the  text  in  the  training  set,  as  well  as 
other  data.  We  achieve  the  best  NIST  error  of  55.2%  on  the  test  set 
when  the  combined  probabilities  are  estimated  as  class  probabilities 
averaged  across  classifiers  supporting  the  decision  of  voting  method. 

5.  CONCLUSIONS 

We  have  described  the  recent  extensions  to  our  previous  work  on 
sentence  segmentation,  where  we  extend  the  set  of  classification 

'according  to  the  Z-test  with  95%  confidence  interval 


WP 

ALL  1 

FM 

NIST 

FM 

NIST 

Boosting 

64.4 

71.5 

68.9 

61.3 

MaxEnt 

65.7 

70.8 

67.3 

63.5 

SVM 

64.7 

71.9 

69.8 

57.6 

IE  (sum) 

66.5 

68.2 

70.7 

57.6 

IE  (prod) 

66.1 

68.6 

70.9 

57.3 

Voting 

66.4 

68.7 

70.6 

56.6 

Linear 

64.1 

70.2 

67.4 

60.0 

Logistic 

64.1 

69.7 

68.6 

59.6 

Table  3.  Sentence  segmentation  performance  on  the  test  set  for  fea¬ 
ture  sets  WP  and  ALL.  The  FM  and  NIST  are  expressed  in  %.  IE 
denotes  inverse  entropy,  and  sum  and  prod  refer  to  the  sum  and  prod¬ 
uct  combination  rules  in  (3)  and  (4),  respectively.  Linear  denotes  the 
linear  regression  technique,  and  Logistic  denotes  the  logistic  regres¬ 
sion  technique.  The  linear  regression  and  logistic  regression  weights 
are  estimated  on  the  development  set.  The  best  performance  for  in¬ 
dividual  classifiers  and  classifier  combination  is  in  boldface. 


Dev 

Test  1 

FM 

NIST 

FM 

NIST 

Boosting-l-HELM 

74.1 

50.9 

70.6 

57.9 

MaxEnt-l-HELM 

71.5 

55.0 

69.2 

61.1 

SVM-l-HELM 

75.5 

48.9 

71.9 

56.7 

IE  (sum)-rHELM 

74.3 

49.5 

71.3 

56.3 

IE  (prod)-l-HELM 

74.5 

49.5 

71.6 

56.1 

Avg.  Prob-(-HELM 

74.2 

49.5 

70.8 

55.2 

Table  4.  Results  of  sentence  segmentation  studies  with  HELM  for 
feature  set  ALL  on  the  development  (Dev)  and  test  set.  The  FM  and 
NIST  are  expressed  in  %.  lE(sum):  the  combined  classifier  output 
probabilities  are  estimated  using  (3),  lE(prod):  the  combined  clas¬ 
sifier  output  probabilities  are  estimated  using  (4),  Avg.  Prob:  Class 
probabilities  averaged  across  the  classifiers  supporting  the  decision 
of  voting  method.  The  best  performance  for  individual  classifiers 
and  classifier  combination  is  in  boldface. 


methods  with  SVMs  and  combine  classification  methods  with  a  new, 
dynamic,  entropy-based  classifier  combination  method,  which  was 
already  shown  to  be  useful  for  ASR.  We  show  improvements  when 
we  use  an  extended  set  of  prosodic  features  in  addition  to  pause  du¬ 
ration  with  all  classifiers.  Furthermore,  to  model  the  sequence  infor¬ 
mation,  we  incorporate  the  combined  classifier  output  with  hidden 
event  language  models.  We  show  statistically  significant  improve¬ 
ments  for  both  NIST  error  rate  and  F-measure. 
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